March 18, 2019
degree of association or relationship between the observed values taken by two variables (\(X\) and \(Y\))
(Pearson) correlation: also has specific mathematical definition (you don't need to know it):
\[r = \frac{\sum_{i}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i^n(x_i - \bar{x})^2}\sqrt{\sum_i^n (y_i - \bar{y})^2}}\]
This captures extent to which deviations from mean of \(X\) move with deviations from mean of \(Y\).
mathematically: correlation is the degree of linear association between \(X\) and \(Y\)
negative correlation: (correlation \(< 0\)) values of \(X\) and \(Y\) move in opposite direction:
positive correlation: (correlation \(> 0\)) values of \(X\) and \(Y\) move in same direction:
It is possible to see perfect correlation but small change in \(Y\) across \(X\)
It is possible to see low correlation but large change in \(Y\) across \(X\)
It is possible to see perfect nonlinear relationship between \(X\) and \(Y\) with \(0\) correlation
weak correlation: values for \(X\) and \(Y\) do not cluster along line
strong correlation: values for \(X\) and \(Y\) cluster strongly along a line
strength of correlation does not fully determine the slope of line describing \(X,Y\) relationship
effect size: this is the slope of the line describing the \(X,Y\) relationship. The larger the effect, the steeper the slope
How do we know a correlation is systematic?
If you look at enough possible sets of variables, you might find a strong correlation
(Arbitary Correlations)[http://www.tylervigen.com/spurious-correlations]
Field of statistics investigates properties of chance events (stochastic processes):
statistical significance:
An indication of how likely correlation we observe could have happened purely by chance.
higher degree of statistical significance indicates correlation is less likely to have happened by chance
\(p\) value:
A numerical measure of statistical significance. Puts a number on how likely observed correlation would have occurred by chance, assuming a we know the chance procedure and the truth is a \(0\) correlation.
It is a probability, so is between \(0\) and \(1\).
Lower \(p\)-values indicate greater statistical significance
\(p < 0.05\) often used as threshold for "significant" result.
\(p\) value:
Be wary of "\(p\)-hacking"
| Statistical Significance |
\(p\)-value | By Chance? | Why? | "Real"? |
|---|---|---|---|---|
| Low | High (\(p > 0.05\)) | Likely | small \(N\) weak correlation |
Probably not |
| High | Low (\(p < 0.05\)) | Unlikely | large \(N\) strong correlation |
Probably |